Overview
This chapter summarizes commonly used advanced analysis methods for single-cell transcriptomics, aiming to help researchers extract richer, biologically interpretable conclusions. We adopt a practical approach, highlighting the definition, applicable scenarios, SeekSoulOnline parameters, result interpretation points, and application examples for each method, while providing common pitfalls and optimization suggestions to facilitate data analysis on SeekSoulOnline.
TIP
How to Read This Section
- This section is quite long. We recommend using the table of contents on the right side of the page to quickly jump to topics of interest and expand specific content as needed.
Pseudotime Analysis
1. Definition of Pseudotime Analysis
Pseudotime analysis is a class of computational methods used to infer the relative progression of cells along a dynamic biological process (such as development, differentiation, disease progression, or response to stimuli) from single-cell transcriptome data. By measuring transcriptomic similarity between cells, it maps discrete cell samples to a low-dimensional space and orders them according to their intrinsic continuity, thereby assigning each cell a relative scale (pseudotime value) representing the "degree of process progression." It is important to emphasize that pseudotime is not equivalent to real time, but rather reflects the relative position of cell states along an assumed trajectory.
Typical analysis approaches include: selecting ordering genes that represent the process, learning or fitting principal graph structures (such as tree or graph models) in low-dimensional space, and projecting cells onto this principal graph to calculate pseudotime values and identify branch points and terminal states.
View More Content (Expand/Collapse)
2. Significance of Pseudotime Analysis
- Overcoming Cell Asynchrony: Single-cell samples typically consist of a mixture of cells at different developmental or response stages. Pseudotime analysis can reorder these asynchronous cells according to their intrinsic progression, revealing continuous biological changes.
- Reconstructing Continuous Processes: Compared to discrete clustering, pseudotime can display continuous transition trajectories between cell states, helping to understand the gradual changes from initial to terminal states.
- Identifying Transitional and Rare Cells: It can detect intermediate-state cells or rare transitioning cells between two stable states, which often carry critical fate-determining information.
- Discovering Dynamic Regulatory Genes and Modules: By analyzing gene expression changes along pseudotime, key regulatory factors that are upregulated or downregulated at different stages of the process, as well as co-varying gene modules, can be identified, providing candidate gene sets for subsequent mechanistic studies.
- Revealing Branch Points and Fate Decisions: When trajectories contain branching structures, pseudotime analysis can locate fate bifurcation points and compare gene expression differences between branches, thereby proposing testable hypotheses about cell fate choices.
- Guiding Experimental Design and Multi-method Cross-validation: Pseudotime results can guide temporal sampling, intervention experiments, or lineage tracing validation experiments; they can also be combined with other methods (such as CytoTRACE based on transcriptional diversity or scVelo based on RNA velocity) to enhance the reliability of conclusions.
TIP
Pseudotime analysis is particularly suitable for studying biological questions with continuous change processes, such as:
- Stem cell differentiation and development
- Activation and exhaustion of immune cells
- Onset and progression of diseases (such as cancer)
- Cellular responses to drugs or environmental stimuli
3. What are the pseudotime analysis methods? How should I choose?
| Tool | Core Principle | Advantages | Disadvantages | Applicable Scenarios and Recommendations |
|---|---|---|---|---|
| Monocle 2 | Pseudotime + DDRTree algorithm. Through dimensionality reduction and graph learning, construct a minimum spanning tree of cells in low-dimensional space, which is a classic method for analyzing complex branching processes. | - Stable and classic results: The constructed branching trajectories are clear and have been validated and accepted by numerous literature. - Mature ecosystem: There are many tutorials and literature supporting it. | - Computational performance: Slow computation and high memory consumption for large datasets (>100,000 cells). - Dependence on starting point: Requires users to specify the starting point of the trajectory. | First choice recommendation. When you want to clearly display how cells differentiate from one state to multiple different terminal states, Monocle 2 is a reliable and validated choice. |
| Monocle 3 | Pseudotime + UMAP embedding. Directly learn graph structures on UMAP or other dimensionality reduction plots to infer trajectories, which is a more simplified analysis strategy. | - Fast and scalable: Friendly to large datasets, can handle millions of cells. - Good compatibility: Compatible with mainstream ecosystems like Scanpy. | - Still under development: Algorithms are rapidly iterating, and the ability to analyze complex branching processes may not be as stable and clear as Monocle 2. | When dealing with large-scale datasets, or when the trajectory structure is relatively simple (such as linear or single branching), it can be used as an efficient alternative. |
| CytoTRACE | Transcriptional diversity. Based on the core assumption that cells with higher differentiation potential have more actively expressed genes, and cells with higher differentiation degrees have more specialized gene expression patterns. | - Completely unbiased: No need to specify starting points, can automatically predict cell differentiation potential. - Good at finding roots: Very powerful in determining the "root" (most original cell cluster) of the trajectory. | - Does not generate trajectory graphs: It mainly provides a ranking of differentiation potential (a numerical value) rather than a visualizable trajectory graph. | When you are unsure which cell cluster is the starting point, or want to objectively verify the starting point selection of Monocle or other tools, strongly recommend using. |
| scVelo | RNA velocity. Through quantifying pre-mRNA (unspliced) and mature-mRNA (spliced) abundance, infer the instantaneous direction and speed of cell state transitions. | - Predicts "future": Can reveal the direction and speed of cell state transitions, providing real dynamic information. - Reveals cyclic processes: Effective for depicting cellular processes like cell cycles. | - High data requirements: Requires high-quality data that can effectively capture intron reads. - Complex result interpretation: Velocity vector graphs may be complex, requiring careful interpretation. | When you want to know the "direction" and "speed" of cell state transitions, not just the "path", it is the best choice. Also suitable for exploring dynamic equilibrium systems. |
Summary and Recommendations:
- Standard workflow: Monocle 2 (trajectory mapping) + CytoTRACE (root determination).
- Exploring direction and velocity: If data quality permits, use scVelo for deeper dynamic analysis.
- Large-scale datasets: Prioritize Monocle 3.
Cell-Cell Communication Analysis
1. Definition and Significance
Cell-cell communication analysis is used to infer signal exchange between cell populations through ligand–receptor (L–R) pairs and downstream signaling pathways from single-cell/spatial transcriptome data. Its goal is not only to identify potential L–R pairs, but also to quantify the "sender/receiver" roles and pathway strength between populations at the pathway and system levels, providing candidate signaling molecules and mechanistic hypotheses for subsequent experimental validation.
TIP
Cell-cell communication results are "predictive" rather than "causal" conclusions and must be combined with expression evidence, literature, and experimental validation (flow cytometry, in situ, or functional blocking experiments).
View More Content (Expand/Collapse)
2. Common Tools and Core Differences
| Tool | Core Principle | Advantages | Disadvantages | Applicable Scenarios and Recommendations |
|---|---|---|---|---|
| CellChat (R) | Based on CellChatDB, aggregates multi-subunit and cofactor information, combines network analysis and pathway summarization to quantify communication strength and cell roles | Rich visualization (circle/chord/sankey, etc.), pathway summarization, and sender/receiver role analysis; suitable for system-level comparisons | Depends on prior database coverage; high computational resource requirements for large datasets; sensitive to low-expression/rare cells | First choice for system-level pathway comparison, sender/receiver role identification, and pathway prioritization |
| CellPhoneDB (Python/CLI) | Uses subunit minimum expression principle to represent complex expression, employs cell label permutation test for significance | Sensitive to complexes with conservative and rigorous statistical methods; mature database for human data | Officially focuses on HUMAN (other species require mapping); visualization and pathway summarization relatively limited | First choice for strict LR significance screening and conservative result validation in human data |
| NicheNet (R) | Integrates ligand-receptor, signal transduction, and TF→target gene networks, uses network propagation to assess ligand regulatory potential on target genes | Provides mechanistic predictions of ligand→target genes and ligand activity scores, suitable for explaining receptor cell differential expression | Does not directly provide inter-population sender/receiver strength measurements, typically requires differential genes as input | First choice when research focuses on whether ligands can explain receptor cell differential genes and downstream regulatory mechanisms |
Note: Differences in database coverage, complex annotation, and statistical strategies among tools can significantly impact detection results; cross-tool validation is recommended when reporting candidates.
3. Priority Recommendations and Selection Guidance
- If the goal is system-level visualization and pathway comparison (including sender/receiver role analysis): Prioritize
CellChat(excellent visualization and pathway summarization capabilities). - If using human data and focusing on statistical significance of ligand-receptor pairs:
CellPhoneDBis a robust choice (emphasizes subunit minimum expression strategy and permutation testing). - If focusing on whether ligands can explain receptor cell differential genes and downstream target gene mechanisms: Use
NicheNetto generate ligand→target gene candidate lists and activity scores. - Comprehensive strategy (recommended): After software analysis, validation through wet-lab experiments or spatial data is recommended.
4. Common Pitfalls
WARNING
- Never rely solely on a single software's "strength values" to draw experimental conclusions; different methods measure different meanings.
- Avoid directly comparing values from different groups without unified annotation and preprocessing; such comparisons are highly prone to technical bias.
5. References and Further Reading (Recommended Links)
- CellChat official repository:
https://github.com/sqjin/CellChat - CellPhoneDB official repository:
https://github.com/ventolab/CellphoneDB - NicheNet official repository:
https://github.com/saeyslab/nichenetr
Copy Number Variation (CNV) Analysis
1. Definition of Copy Number Variation
Copy Number Variation (CNV) is a type of genomic structural variation, referring to the increase or decrease in the copy number of DNA fragments longer than 1kb compared to the reference genome. CNV is an important source of genomic differences within and between species, and is closely related to the occurrence and development of many diseases. Especially in tumor research, CNV is an important feature of tumor heterogeneity.
In single-cell RNA sequencing (scRNA-seq) data, CNV can be indirectly inferred by analyzing changes in gene expression levels across consecutive genomic regions. If the overall gene expression level in a chromosomal region increases or decreases, it may indicate an increase or decrease in copy number in that region.
View More Content (Expand/Collapse)
2. Significance of Copy Number Variation Analysis
Single-cell CNV analysis is of great significance in tumor research:
- Distinguishing tumor cells from non-tumor cells: Tumor cells are usually accompanied by extensive genomic instability, manifested as widespread CNV. By analyzing the CNV patterns of individual cells, malignant tumor cells can be effectively distinguished from normal cells mixed in tumor tissue (such as immune cells, stromal cells, etc.).
- Revealing tumor clonal heterogeneity: Tumor tissue is composed of subclones with different genomic characteristics. Single-cell CNV analysis can identify tumor subclones with different CNV patterns, thereby revealing the clonal structure and evolutionary relationships within tumors.
- Identifying key tumor driver genes: Through CNV analysis, chromosomal regions with frequent copy number increases or decreases can be located. These regions may contain key oncogenes or tumor suppressor genes, providing clues for finding new therapeutic targets.
- Assessing tumor evolution and drug resistance mechanisms: By comparing CNV patterns of tumor cells at different treatment stages or different metastatic sites, the evolutionary path of tumors can be tracked, and genomic variations related to drug resistance can be studied.
3. Differences Between InferCNV and CopyKAT and Selection Recommendations
InferCNV and CopyKAT are two bioinformatics tools widely used for inferring CNV from scRNA-seq data. They differ in algorithm principles, applicable scenarios, and result focus.
| Tool | Core Principle | Advantages | Disadvantages | Applicable Scenarios and Recommendations |
|---|---|---|---|---|
| InferCNV | Infers CNV by comparing gene expression profiles of tumor cells with a set of "normal" reference cells, using sliding windows to smooth gene expression. | - Classic algorithm with reliable results - Focuses on large-scale chromosome arm-level variations | - Must provide high-quality normal reference cells - Relatively low resolution - High computational load, time-consuming - Results require user interpretation and cell population classification | Suitable for scenarios with clear and reliable normal cells as reference (such as adjacent normal tissue, immune cells) and focus on large-scale variations. |
| CopyKAT | Uses an integrated Bayesian method to identify CNV by comparing with a mixture model of genomic location and expression. | - No need to predefine reference cells, can automatically identify - Higher resolution (~5Mb) - High computational efficiency, fast - Results automatically predict malignant/normal status of cells | - Accuracy depends on the presence of sufficient normal cells in the data as internal reference - For complex tumor internal heterogeneity, automatic subclone classification may be oversimplified | Suitable for scenarios without clear normal controls or hoping to automatically distinguish tumor/normal cells. Its automation makes it the preferred choice for exploratory analysis. |
Selection Recommendations:
- If your data contains clear and reliable normal cells as reference (for example, non-epithelial cells from adjacent normal tissue, or clearly annotated immune cells), and you are more concerned with large-scale chromosome arm-level variations, InferCNV is a classic and reliable choice.
- If your data does not have clear normal cell controls, or you want the algorithm to automatically distinguish tumor and normal cells, and are interested in higher-resolution CNV events, CopyKAT is the better choice. Its higher degree of automation makes results easier to interpret.
- In actual analysis, both tools can be used simultaneously for mutual validation to obtain more reliable conclusions.
Regulatory Network Analysis
1. Definition of Regulatory Network Analysis
Regulatory network analysis is an important class of methods for inferring inter-gene regulatory relationships and identifying functional modules from single-cell transcriptome data. By integrating gene expression correlations, transcription factor binding site information, and cell type-specific expression patterns, it constructs cell-specific gene regulatory networks to reveal the transcriptional regulatory mechanisms driving cell state transitions and functional differentiation.
Typical analysis approaches include: identifying co-regulated gene modules based on gene co-expression patterns (such as hdWGCNA), inferring regulatory relationships between transcription factors and target genes (such as SCENIC), and revealing the biological significance of modules through functional enrichment analysis.
View More Content (Expand/Collapse)
2. Significance of Regulatory Network Analysis
- Revealing transcriptional regulatory mechanisms: By constructing gene regulatory networks, key transcription factors and their regulated target genes can be identified, revealing the transcriptional regulatory mechanisms behind cell state transitions.
- Identifying functional gene modules: Through gene co-expression network analysis, gene modules that are co-expressed in specific cell types can be identified, providing clues for understanding cell functions.
- Discovering cell type-specific regulatory factors: By analyzing the activity differences of regulons in different cell types, cell type-specific key regulatory factors can be identified.
- Understanding the molecular basis of cell heterogeneity: Regulatory network analysis helps understand the gene regulatory differences behind cell heterogeneity, providing molecular basis for cell classification and functional annotation.
- Guiding target screening and mechanistic research: By identifying key regulatory modules and regulatory factors, candidate gene sets can be provided for subsequent functional validation experiments and drug target screening.
- Integrating multi-omics information: Regulatory network analysis can integrate multi-omics information such as gene expression, epigenetics, and protein interactions, providing more comprehensive regulatory mechanism analysis.
TIP
Regulatory network analysis is particularly suitable for studying biological questions with complex regulatory relationships, such as:
- Molecular mechanisms of cell fate determination
- Regulatory network reconstruction of disease-related genes
- Drug mechanism of action and drug resistance research
- Dynamic changes in transcriptional regulation during development
3. What are the regulatory network analysis methods? How should I choose?
| Tool | Core Principle | Advantages | Disadvantages | Applicable Scenarios and Recommendations |
|---|---|---|---|---|
| hdWGCNA | Gene co-expression network + module identification. Constructs weighted gene co-expression networks based on gene expression correlations, identifies functionally related gene modules through hierarchical clustering. | - Modular analysis: Can identify cell type-specific functional gene modules - Eigengene extraction: Quantifies module activity through Module Eigengenes - Functional enrichment: Reveals biological significance of modules combined with GO/KEGG functional enrichment analysis | - Indirect regulation: Mainly based on co-expression relationships, cannot directly verify regulatory relationships - Parameter sensitive: Analysis results are affected by parameters such as soft threshold | First choice recommendation. When you want to identify co-expressed gene modules in cell types and understand their functions, hdWGCNA is a mature and reliable choice. |
| SCENIC | Regulatory network inference + regulon activity assessment. Infers regulatory relationships between transcription factors and target genes based on gene co-expression patterns, validates regulatory relationships with motif analysis, and calculates regulon activity through AUCell algorithm. | - Direct regulation: Can infer direct regulatory relationships between transcription factors and target genes - Regulon activity quantification: Quantifies regulon activity status in each cell through AUC values - Cell state identification: Groups and annotates cells based on regulatory network activity | - Computationally complex: Analysis workflow is relatively complex with high computational resource requirements - High data quality requirements: High requirements for input data quality | When you want to deeply understand transcriptional regulatory mechanisms and identify key transcription factors and their regulated target genes, SCENIC is the best choice. |
Gene Set Scoring Analysis
1. Definition of Gene Set Scoring Analysis
Gene set scoring analysis is a method used in single-cell transcriptome data to assess the activity of predefined gene sets. By comprehensively scoring the activity of each cell or cell population, this method can quantify the enrichment degree of specific biological pathways, functions, or states. This scoring mechanism provides a powerful tool for revealing intrinsic cell heterogeneity and deeply exploring biological differences under different cell states.
View More Content (Expand/Collapse)
2. Significance of Gene Set Scoring Analysis
In single-cell research, gene set scoring analysis has the following important values:
- Functional annotation: Can functionally annotate unknown cell populations, revealing their biological roles.
- State comparison: Can compare activity changes of specific biological pathways under different experimental conditions or cell types.
- Heterogeneity exploration: Helps discover functional heterogeneity within the same cell population, identifying subpopulations in different states through activity differences.
- Biological insights: Provides key molecular-level explanations for complex biological processes such as disease mechanisms, cell differentiation, and drug responses.
3. What are the gene set scoring analysis methods? How should I choose?
| Tool | Core Principle | Advantages | Disadvantages | Applicable Scenarios and Recommendations |
|---|---|---|---|---|
| scMetabolism | Based on VISION and AUCell algorithms, with 78 built-in KEGG and REACTOME metabolic pathways. | Specifically optimized for metabolic pathways, results are intuitive and easy to interpret. | Limited to 78 predefined metabolic pathways, not applicable to other functional gene sets. | Focus on cellular metabolic pathway analysis. Recommended when you want to quickly study metabolic reprogramming at the single-cell level. |
| GSVA | Uses non-parametric, unsupervised methods to transform gene expression matrices into gene set enrichment score matrices. | Wide applicability, can be used for any custom or public gene sets; suitable for cross-sample, cross-condition pathway activity comparisons. | Biological interpretation of results depends on gene set quality; may not be sensitive enough for sparse single-cell data. | General gene set variation analysis. Recommended when comprehensive, unbiased pathway analysis based on large public databases like MSigDB is needed. |
| Scoring | Integrates multiple commonly used scoring algorithms such as AUCell, UCell, singscore, AddModuleScore. | High flexibility, allows users to use custom gene sets and compare results from different algorithms; helpful for cross-validation. | Requires users to have some understanding of the principles and applicability of different algorithms to choose the most appropriate method. | Flexible gene set scoring tool. Recommended when users have custom gene sets or want to cross-validate results through multiple different algorithms. |
| Feature Analysis | AddModuleScore | Uses Seurat package's AddModuleScore function to calculate gene set scores in each cell | - Easy to use, intuitive results, adjustable plotting styles | - May be affected by gene set size and expression levels |
Perturbation Analysis
1. Definition of Perturbation Analysis
Perturbation analysis refers to a class of analysis methods in single-cell transcriptomics that evaluates the transcriptional response differences and sensitivity of cell populations or specific cell types under conditions defined by artificially defined or observed external/internal condition differences (such as disease states, drug treatments, gene knockout/overexpression, time points, etc.) as perturbations. The goal of perturbation analysis is to identify cell types or gene modules that respond most significantly under different conditions, thereby revealing potential biological regulatory and functional differences.
View More Content (Expand/Collapse)
2. Significance of Perturbation Analysis
- Identifying key responsive cell types: By comparing transcriptional profiles under different conditions, identify cell types most sensitive to perturbations, providing targets for subsequent functional validation.
- Revealing condition-specific mechanisms: Identify genes and pathways that are differentially expressed or have altered activity under perturbations, helping understand molecular regulatory mechanisms.
- Prioritization and resource allocation: In complex tissues or diverse cell populations, perturbation analysis can be used to prioritize cell subpopulations that need in-depth experimental validation, thereby saving experimental resources.
- Guiding intervention strategies and biomarker discovery: By identifying cell types and genes sensitive to treatment/processing, it can provide basis for precision therapy and biomarker development.
3. How to Set Perturbation Factors and Perturbation Objects
Perturbation factor: Usually corresponds to a column in the
metadataof Seurat/SingleCellExperiment objects, whose values are used to define different experimental conditions or groups (such astreatment,disease_status,timepoint, etc.). Setting recommendations:- Ensure the column contains clear, mutually exclusive group labels with at least two different levels (e.g., treatment vs control, patient vs healthy).
- Avoid Chinese characters or special characters in label naming; recommend using short English or underscore-connected names (such as
treated,control,timepoint_0,timepoint_24h). - For continuous variables (such as dose, time), they can be discretized into several groups for comparison, or use analysis methods that specifically support continuous perturbations.
Perturbation objects: Refers to the target groups to be compared under perturbation factors, which can be the entire sample set, specified cell types, or finer subpopulations (cluster/subcluster). Setting recommendations:
- Select at least two perturbation objects (e.g.,
treatedvscontrol), and ensure sufficient cell numbers in each group to guarantee robustness of statistical/machine learning methods (recommended >100 cells per group, depending on the method). - When the analysis goal is cell type sensitivity, it is recommended to run perturbation analysis individually by cell type or subpopulation to obtain response priority for each cell type.
- For data with batch effects or inter-sample differences, consider batch as a covariate during setup or perform batch correction in the preprocessing stage.
- Select at least two perturbation objects (e.g.,
TIP
Recommended workflow: First confirm the perturbation factor column and grouping in the full data or target samples, then run perturbation analysis individually on biologically relevant cell types with sufficient cell numbers to obtain robust cell type priority rankings.
CAUTION
- Avoid directly interpreting perturbation results on cell types with very few cells; small samples can lead to model instability or overfitting.
- If the perturbation factor contains missing values or imbalanced distribution in metadata, data cleaning should be performed first or consider downsampling/upsampling strategies.
- Tool recommendations: Use Augur to prioritize different cell types; it is also recommended to combine with differential expression, regulatory networks, cell communication and other methods for cross-validation to improve result credibility.
